This document explores Sydney Airbnb Dataset. This data set is from kaggle.
# import all packages and set plots to be embedded inline
# packages for data process and visalization
import copy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# packages for plot on map
import geopandas as gpd
import descartes
from shapely.geometry import Point, Polygon
import contextily as ctx
import geoplot as gplt
# packages for linear regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
# packages for text process
import re
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline
# load the listing summary dataset
sl_df = pd.read_csv('sydney_listings_summary_dec18.csv')
print(sl_df.shape)
sl_df.head()
print(sl_df.dtypes)
sl_df.isnull().sum()
# convert 'last_review' datatype to datetime
sl_df.last_review = sl_df.last_review.astype("datetime64")
# drop neighbourhood_group for all values are null
sl_df.drop(columns = {'neighbourhood_group'}, inplace = True)
# load the calendar summary dataset
sc_df = pd.read_csv('sydney_calendar_dec18.csv')
print(sc_df.shape)
sc_df.head()
print(sc_df.dtypes)
sc_df.isnull().sum()
# convert date column to datetime64
sc_df.date = sc_df.date.astype("datetime64")
# change the available column values to 1 and 0
sc_df.available = sc_df.available.apply(lambda x : 1 if x == 't' else 0)
# convert price from string to float
def getnum(price):
'''
INPUT:
price - a string format price with the first character of '$' and ',' inside
or a interger
OUTPUT:
price - int type price
'''
if type(price) != str:
return price
else:
return ''.join(price[1:].split(','))
sc_df.price = sc_df.price.apply(getnum)
sc_df.price = sc_df.price.apply(float)
# loading sydney_reviews_dec18
sr_df = pd.read_csv('sydney_reviews_dec18.csv')
print(sr_df.shape)
sr_df.head()
print(sr_df.dtypes)
sr_df.isnull().sum()
# convert date datatype
sr_df.date = sr_df.date.astype("datetime64")
sl_df
Sydney Airbnb Open Data contains 6 datasets and 1 geojson file. I will use 3 datasets and the geojson file in it.
The first dataset sl_df(short for 'Sydeny list dataframe') contains 36,663 records with 15 features. This dataset has a unique key ('id'), most of the features are numeric variables, including two spetial features('latitude' and 'longitude'), one datetime feature and several ordinary numeric features. This dataset also has two important categorical features, 'neighbourhood' and 'room type'
The second dataset sc_df('Sydney calendar dataframe') contains 13,381,265 records, it's just 365 times of records in the first dataset. This dataset has 4 features: listing_id, date, available and price, I will use them for a time series analysis.
The third dataset sr_df('Sydney reviews dataframe') contains 446,708 rows, I will use its comment column.
The last two datasets both have a foreign key that match the primary key of the first dataset, I can join these datasets on this ID key.
# set a new column, if owner only has this one room, marked as low
# if owner has 2 to 4 rooms('calculated_host_listings_count' is 2 to 4), marker as medium,
# if owner has more than 4 rooms, marker as high
def L_to_H(count):
'''
INPUT:
count - positive int
OUTPUT:
str - if count is 1, return low
if count between 2 and 4, return medium
if large or equal to 5, return high
'''
if count == 1:
return 'Low'
if count >=2 and count<=4:
return 'Medium'
if count >= 5:
return 'High'
sl_df['owner_own_rooms'] = sl_df['calculated_host_listings_count'].apply(L_to_H)
# set the new column to orered categorical type
order = ['Low','Medium','High']
ordered_var = pd.api.types.CategoricalDtype(ordered = True, categories = order)
sl_df['owner_own_rooms'] = sl_df['owner_own_rooms'].astype(ordered_var)
# take a look at the univariate distribution of 3 features(neighbourhood,room_type and calculated_host_listings_count)
base_color = sns.color_palette()[0]
plt.figure(figsize = [8,10])
plt.subplot(2,2, 1)
sns.countplot(data = sl_df, x = 'room_type',color = base_color,
order = sl_df.room_type.value_counts().index)
plt.title('Room Type')
plt.xlabel('')
plt.xticks(rotation = 10)
for i in range (3):
count = sl_df.room_type.value_counts()[i]
pct_string = '{:.1%}'.format(count/sl_df.shape[0])
plt.text(i-0.25,count+200,pct_string, va = 'baseline',fontsize = 10);
plt.subplot(2, 2, 2)
sns.countplot(data = sl_df, x = 'owner_own_rooms',color = base_color,
order =sl_df.owner_own_rooms.value_counts().index)
plt.title('Calculated Host Listings Count')
plt.xlabel('')
plt.ylabel('')
plt.yticks([])
for i in range (3):
count = sl_df.owner_own_rooms.value_counts()[i]
pct_string = '{:.1%}'.format(count/sl_df.shape[0])
plt.text(i-0.25,count+200,pct_string, va = 'baseline',fontsize = 10);
plt.subplot(2, 2, (3,4))
sns.countplot(data = sl_df, x = 'neighbourhood',color = base_color,
order = sl_df.neighbourhood.value_counts().index)
plt.xticks(rotation = 90)
plt.xlabel('Neighbourhood',fontsize = 12)
plt.show();
The vast majority of the room types are Entire home and Private room, only 1.7% is shared room.
Most of airbnb houses(63.9%) are owned by someone only have one airbnb property. Other 36.9% are owned by owners with at least 2 airbnb properties.
Sydney, Waverley and Randwick neighbourhood have the most rooms.
## room types vs host count
plt.figure(figsize = [8, 5])
sns.countplot(data = sl_df, x = 'owner_own_rooms', hue = 'room_type', palette = 'Blues')
plt.xlabel('Calculated Host Listings Count')
plt.title('The count of rooms of different room type and host type')
plt.xlabel('')
plt.show();
There are more than 8000 hosts only have 1 room, and this room is a private room. It means they are sharing one room of their house to do the business.
## host count vs room type
plt.figure(figsize = [8, 5])
sns.countplot(data = sl_df, x = 'room_type', hue = 'owner_own_rooms', palette = 'Greens',
order = sl_df.room_type.value_counts().index)
plt.xlabel('Room Types')
plt.show();
# room type vs host count
plt.figure(figsize = [8, 12])
sns.countplot(data = sl_df, y = 'neighbourhood', hue = 'room_type', palette = 'Blues',
order = sl_df.neighbourhood.value_counts().index)
plt.show();
# room type vs host count
plt.figure(figsize = [8, 12])
sns.countplot(data = sl_df, y = 'neighbourhood', hue = 'owner_own_rooms', palette = 'Blues',
order = sl_df.neighbourhood.value_counts().index)
plt.show();
This dataset contains spetial information, I'd like to plot the data on a map using geopandas. First, I want to plot a heat map of average price of each neighbourhood.
# read neighbourhoods.geojson from the zip file, and set the CRS
area = gpd.read_file('neighbourhoods.geojson')
area.crs = {'init' :'epsg:4326'};
area.shape
# add a average price column to the geopandas dataset
price_area = area.merge(sl_df.groupby('neighbourhood').price.mean().reset_index(),on = 'neighbourhood' )
price_area
gplt.choropleth(price_area, hue='price', edgecolor='black', linewidth=0.1,
cmap='Greens', legend=True, scheme='Percentiles', figsize = (10,10))
plt.title('Average Airbnb Price of each neighbourhood($)')
plt.show();
Overall, average Airbnb prices of neighbourhood near water(see or bay) are higher than prices not near water.
The Pittwater area have the highest average price. This heighbourhood is famous for its great sea view and large houses, since there are almost no apartments in the area, the number of available houses is relatively not enough. These factors lead to higher airbnb prices.
Neighbourhoods around the Sydney harbour water have the second highest prices, for these areas are well-known tourist destination.
Next, I want to plot the house on the map base on their latitude and longitude. There is a problem, the the polygon data in the neighbourhood.geojason file is relatively rough, do not have enough Granularity, I try to find another .shp file with higher granular from this link
# load .shp file
aus_poas = gpd.read_file('aus_poas.shp')
aus_poas.head()
# filt out the polygon data of NSW, and set the CRS to 'epsg:4326'
syd_poas = aus_poas.query('code>= 2000 & code <= 3000 ')
syd_poas.crs = {'init' :'epsg:4326'};
# add the geometry featurn to sl_df, and set CRS
geometry = [Point(xy) for xy in zip(sl_df['longitude'],sl_df['latitude'])]
crs = {'init':'epsg:4326'}
geo_df = gpd.GeoDataFrame(sl_df,crs = crs, geometry = geometry)
geo_df.head()
# plot all houses on the map
fig,ax = plt.subplots(figsize = (15,15))
syd_poas.plot(ax = ax, alpha = 0.3,color= 'white', edgecolor='gray')
plt.scatter(data = geo_df, x = 'longitude', y = 'latitude', s =2, alpha= 0.5)
plt.xlim(150.75,151.35)
plt.ylim(-34.1,-33.58)
plt.xticks([])
plt.yticks([])
plt.title('Location of Airbnb')
# add basemape to have a more clearly view
ctx.add_basemap(ax, crs=aus_poas.crs,zoom=12);
The vast majority of the Airbnb properties are located along the west coast and around the Syndey harber waters. In the inner of Sydney, Airbnb properties locations are following the train line, such as T1(from from North Sydney to Hornsby) and T2(from Central to Parramatta).
# zoom out the xlim and ylim, to to have a more clear view around Sydney Harbour area
fig,ax = plt.subplots(figsize = (15,15))
syd_poas.plot(ax = ax, alpha = 0.3,color= 'white', edgecolor='gray')
plt.scatter(data = geo_df, x = 'longitude', y = 'latitude', s = 1.5,alpha= 0.5)
plt.xlim(151.125,151.325)
plt.ylim(-33.925,-33.755)
plt.xticks([])
plt.yticks([])
plt.title('Airbnb rooms near Sydney Harbour Water')
ctx.add_basemap(ax, crs=aus_poas.crs,zoom=12);
After focusing on the core area of Sydeny, Airbnb properties highly clustered in three areas, around CBD, around Bondi beach and around Manly beach. They are the most popular area of Sydney for the tourists even for locals, and there are a lot of apartments.
# get 3 subplots on the room type
fig,ax = plt.subplots(3,1,figsize = (15,20))
syd_poas.plot(ax = ax[0], alpha = 0.3,color= 'white', edgecolor='k')
ax[0].scatter(data = geo_df[geo_df['room_type'] == 'Entire home/apt'], x = 'longitude',
y = 'latitude', s = 3,alpha= 0.5)
ax[0].set_xlim(150.55,151.4)
ax[0].set_ylim(-33.96,-33.62)
ctx.add_basemap(ax[0], crs={'init': 'epsg:4326'}, zoom=12)
ax[0].set_title('Airbnb of Entire home or apartment')
ax[0].set_xticks([])
ax[0].set_yticks([]);
syd_poas.plot(ax = ax[1], alpha = 0.3,color= 'white', edgecolor='k')
ax[1].scatter(data = geo_df[geo_df['room_type'] == 'Private room'], x = 'longitude',
y = 'latitude', c= 'red',s = 3,alpha= 0.5)
ax[1].set_xlim(150.55,151.4)
ax[1].set_ylim(-33.96,-33.62)
ctx.add_basemap(ax[1], crs={'init': 'epsg:4326'}, zoom=12)
ax[1].set_title('Airbnb of Private room')
ax[1].set_xticks([])
ax[1].set_yticks([]);
syd_poas.plot(ax = ax[2], alpha = 0.3,color= 'white', edgecolor='k')
ax[2].scatter(data = geo_df[geo_df['room_type'] == 'Shared room'], x = 'longitude',
y = 'latitude', c= 'green',s = 3,alpha= 0.5)
ax[2].set_xlim(150.55,151.4)
ax[2].set_ylim(-33.96,-33.62)
ctx.add_basemap(ax[2], crs={'init': 'epsg:4326'}, zoom=12);
ax[2].set_title('Airbnb of Shared room')
ax[2].set_xticks([])
ax[2].set_yticks([]);
We can see entire home/apt mostly located in the city central area or alone the seaside,except that there is some gathering in Parramata, most of them are located on the east side of the city.
Private rooms are more widely distributed, but these located in Pittwater(that area have the highset average Airbnb price as just analysied) is much less than entire home/apt.
As for the shared room, because it is too small in quantity, it is only distributed around CBD and Bondi beach area.
# at last get 3 subplots on the owners have how many airbnb properties
fig,ax = plt.subplots(3,1,figsize = (15,20))
syd_poas.plot(ax = ax[0], alpha = 0.3,color= 'white', edgecolor='k')
ax[0].scatter(data = geo_df[geo_df['owner_own_rooms'] == 'Low'], x = 'longitude', y = 'latitude', s = 1,alpha= 0.5)
ax[0].set_xlim(150.55,151.4)
ax[0].set_ylim(-33.96,-33.62)
ctx.add_basemap(ax[0], crs={'init': 'epsg:4326'}, zoom=12)
ax[0].set_title('Airbnb location with its owner only have 1 room')
ax[0].set_xticks([])
ax[0].set_yticks([]);
syd_poas.plot(ax = ax[1], alpha = 0.3,color= 'white', edgecolor='k')
ax[1].scatter(data = geo_df[geo_df['owner_own_rooms'] == 'Medium'], x = 'longitude', y = 'latitude', c= 'red',s = 1,alpha= 0.5)
ax[1].set_xlim(150.55,151.4)
ax[1].set_ylim(-33.96,-33.62)
ctx.add_basemap(ax[1], crs={'init': 'epsg:4326'}, zoom=12)
ax[1].set_title('Airbnb location with its owner have 2-4 rooms')
ax[1].set_xticks([])
ax[1].set_yticks([]);
syd_poas.plot(ax = ax[2], alpha = 0.3,color= 'white', edgecolor='k')
ax[2].scatter(data = geo_df[geo_df['owner_own_rooms'] == 'High'], x = 'longitude', y = 'latitude', c= 'green',s = 1,alpha= 0.5)
ax[2].set_xlim(150.55,151.4)
ax[2].set_ylim(-33.96,-33.62)
ctx.add_basemap(ax[2], crs={'init': 'epsg:4326'}, zoom=12);
ax[2].set_title('Airbnb location with its owner have more than 4 rooms')
ax[2].set_xticks([])
ax[2].set_yticks([]);
If classify owners who operate more than 4 stores as specializing in airbnb business, from the 3rd plot, I can find professional owners always choose the 3 hottest area or several surburb center to do their business.
Owners who operate 4 or less rooms affected by location of property owned, their Airbnb business are more widely distributed.
The answer to the first question, Is room type and host listing counts has any regional distribution pattern? I find:
Compare to entire home/apt and shared room, Private room has the largest quantity, and have the widest distribution. But in some hottest area, especailly in north seaside(from Pittwater to Manly), they are obviously less than entire home/apt.
Professional Airbnb operator like to run their business at the highest 3 clustered areas, one reason is there is a lot of demand, another reason is there are many apartments can be rent to run this business.
Now, I have seen the big picture of Sydney Airbnb industry, next, I will dig into its available rate and price by time series analysis.
This time, I will use sc_df to see the 'available' and 'price' features.
# use datetime as index for better time series analysis
sc_df.set_index('date',inplace = True)
# check null values
print(sc_df.isnull().sum())
print('sc_df has ',sc_df.shape[0], 'rows')
print('sc_df has ',sc_df.shape[0] - sc_df.available.sum(), 'rows are not available')
print('All rooms have average ', (1 - sc_df.available.sum()/sc_df.shape[0])*100, '% time not available')
I can see the count of null price is the same of not available rows(both 9810841), when the room is not available, there will be no price, the price column will be null. I will not need to deal with this value. Becaues when I try to get the mean value of price, Python will ignore these rows automaticly.
# get two time serier line plot
fig,ax = plt.subplots(2,1,figsize = (10,10))
sc_df.resample('D').price.mean().plot(ax = ax[0],kind='line')
ax[0].set_title('Average Available Airbnb Price')
ax[0].set_ylabel('Average Price $')
ax[0].set_xlabel('')
sc_df.resample('D').available.mean().plot(ax = ax[1], kind='line')
ax[1].set_title('Available Rate')
ax[1].set_ylabel('Average Rate')
ax[1].set_xlabel('')
plt.show();
There is a lot of information :
Next, I will separate the dataset into two parts and to see the time trend separately
# check the value_count of neighbourhood
print(sl_df.neighbourhood.value_counts())
# check the location and total pencentage of first 10 neighbourhood
first_10 = sl_df.neighbourhood.value_counts().index[:10]
area.query('neighbourhood in @first_10').plot()
print('These 10 neighbourhoods totally have ',
(sl_df.neighbourhood.value_counts()/sl_df.shape[0]).values[:10].sum()* 100,
'% airbnb of Sydney.')
The first 10 neighbour hood are all famous waterside areas, they totally have more than 3/4 Airbnb rooms of all Sydney. I will split the dataset here.
sub_list = sl_df.neighbourhood.value_counts().index
# separate the neighbourhood into two parts from Ryde, neighbourhood before Ryde are mostly for
# tourists, neighbourhood after Ryde are most for business and other
sub_H = sub_list[:10]
sub_L = sub_list[10:]
# get the id list of both neighbourhood parts
id_H =sl_df.query('neighbourhood in @sub_H').id.value_counts().index
id_L =sl_df.query('neighbourhood in @sub_L').id.value_counts().index
# split the sc_df dataset
sc_df_H = sc_df.query('listing_id in @id_H')
sc_df_L = sc_df.query('listing_id in @id_L')
# plot the average price and available rate of the first half of the dataset
fig,ax = plt.subplots(4,1,figsize = (10,14))
sc_df_H.resample('D').price.mean().plot(ax = ax[0],kind='line')
ax[0].set_title('Top 10 Areas\' Average Available Airbnb Price')
ax[0].set_ylabel('Average Price $')
ax[0].set_xlabel('')
sc_df_H.resample('D').available.mean().plot(ax = ax[1], kind='line')
ax[1].set_title('Top 10 Areas\' Available Rate')
ax[1].set_ylabel('Average Rate')
ax[1].set_xlabel('')
sc_df_L.resample('D').price.mean().plot(ax = ax[2],kind='line')
ax[2].set_title('Rest area\' Average Available Airbnb Price')
ax[2].set_ylabel('Average Price $')
ax[2].set_xlabel('')
sc_df_L.resample('D').available.mean().plot(ax = ax[3], kind='line')
ax[3].set_title('Rest area\' Available Rate')
ax[3].set_ylabel('Average Rate')
ax[3].set_xlabel('')
plt.show();
It shows that, in the first 10 neighbourhood, despite the large supply of rooms(more than 75% of all city's supply), demand is still very strong.
We can see, in the first 10 area, more supply, but still higher price and low available rate, means demand distribution is very uneven, most of the needs are clustered in several regions. So if someone want to enter this industry, its better to run their business in the first 10 areas.
I will move to my next question, using machine learning to see What features of the dataset affect the house price more.
# check the columns of the sl_df
sl_df.columns
sl_df actually contains all quantity information of the other two datasets(sc_df and sr_df) 'number_of_reviews' is the count of 'commits' of sr_df, 'availability_365' is the sum of available column of sc_df. As I want to see what affect the price, I will use 'price' of sl_df as the explained variable, I will give up the 'price' feature of sc_df.
# find out all useful columns, and make a copy
cal_df = sl_df[['neighbourhood','room_type','price','minimum_nights','number_of_reviews',
'reviews_per_month','calculated_host_listings_count', 'availability_365']].copy()
cal_df
There are 6 numeric features 'price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365' and 2 categorical features:'neighbourhood' and 'room_type'
cal_df.info()
# check the null values
cal_df.isnull().sum()
# from the sr_df dataset to find how many id have commit
sr_df.listing_id.value_counts()
These two number matches, in sl_df, there are 24725 rows have number_of_reviews values, its just the same as the id counts with any commits.
So, any room have no commits will show null value in 'reviews_per_month' column of sl_df. I only need to fill these null with 0, because they do not have any review commite.
# fill the null value with 0
cal_df['reviews_per_month'] = cal_df['reviews_per_month'].fillna(0)
cal_df.isnull().sum()
# set a create_dummy_df function to convert all categorical features into dummy variables
def create_dummy_df(df, cat_cols, dummy_na = False):
'''
INPUT:
df - pandas dataframe with categorical variables you want to dummy
cat_cols - list of strings that are associated with names of the categorical columns
dummy_na - Bool holding whether you want to dummy NA vals of categorical columns or not
OUTPUT:
df - a new dataframe that has the following characteristics:
1. contains all columns that were not specified as categorical
2. removes all the original columns in cat_cols
3. dummy columns for each of the categorical columns in cat_cols
4. if dummy_na is True - it also contains dummy columns for the NaN values
5. Use a prefix of the column name with an underscore (_) for separating
'''
for col in cat_cols:
df = pd.concat([df.drop(col,axis = 1),
pd.get_dummies(df[col],prefix = col,prefix_sep = '_',
drop_first = True, dummy_na = dummy_na)],
axis = 1)
return df
# get dummy features from the two categorical features
cat_cols_lst = cal_df.select_dtypes(include = ['object']).columns
# there is no need to deal with null values,
cal_df_new = create_dummy_df(cal_df, cat_cols_lst, dummy_na=False)
cal_df_new.shape
# make a regression
# Split full_df_new into an X matrix and a response vector y
y = cal_df_new['price']
X = cal_df_new.drop('price',axis = 1)
# Create training and test sets of data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)
# Instantiate a LinearRegression model with normalized data
lm_model = LinearRegression(normalize=True)
# Fit the model to the training data
lm_model.fit(X_train, y_train)
# Predict the response for the training data and the test data
y_test_preds = lm_model.predict(X_test)
y_train_preds = lm_model.predict(X_train)
# get the r2 score of both test data and train data
test_score = r2_score(y_test, y_test_preds)
train_score = r2_score(y_train, y_train_preds)
print('test_score is: ',test_score)
print('train_score is: ',train_score)
The regression show's these features can only explain about 13% of the price changing. I think that's reasonable, features such as room area, indoor facilities, parking space, renovation time, whether to provide meals, traffic convenience, ect, are all very important.
In order to have better regression score, I find more information in listings_dec18.csv file. For not too complicated, I will add several important features : 'host_is_superhost','host_identity_verified','accommodates','bathrooms','beds'
full_list_df = pd.read_csv('listings_dec18.csv')
print(full_list_df.shape)
full_list_df.head()
full_list_df.host_response_time.count()
# get the interesting features
interest_df = full_list_df[['id','host_is_superhost','host_identity_verified','accommodates','bathrooms','bedrooms','beds']]
# join with the sl_df
sl_df2 = sl_df.merge(interest_df,on = 'id',how = 'left')
sl_df2.shape
cal_df = sl_df2[['neighbourhood','room_type','price','minimum_nights','number_of_reviews',
'reviews_per_month','calculated_host_listings_count', 'availability_365',
'host_is_superhost','host_identity_verified','accommodates','bathrooms',
'bedrooms','beds']].copy()
print(cal_df.shape)
cal_df.isnull().sum()
# deal with null values just as above
cal_df['reviews_per_month'] = cal_df['reviews_per_month'].fillna(0)
host_is_superhost,host_identity_verified,bathrooms,beds only have very few null values(less than 1/1000) I will drop these rows.
cal_df = cal_df.dropna(axis = 0, how = 'any')
cal_df.shape
# get dummy features from the two categorical features
cat_cols_lst = cal_df.select_dtypes(include = ['object']).columns
# there is no need to deal with null values,
cal_df_new = create_dummy_df(cal_df, cat_cols_lst, dummy_na=False)
cal_df_new.shape
# make a regression
# Split full_df_new into an X matrix and a response vector y
y = cal_df_new['price']
X = cal_df_new.drop('price',axis = 1)
# Create training and test sets of data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .1)
# Instantiate a LinearRegression model with normalized data
lm_model = LinearRegression(normalize=True)
# Fit the model to the training data
lm_model.fit(X_train, y_train)
# Predict the response for the training data and the test data
y_test_preds = lm_model.predict(X_test)
y_train_preds = lm_model.predict(X_train)
# get the r2 score of both test data and train data
test_score = r2_score(y_test, y_test_preds)
train_score = r2_score(y_train, y_train_preds)
print('test_score is: ',test_score)
print('train_score is: ',train_score)
The regression show's these features can explain about more than 30% of the price changing, That's much better. Finally, I want to see within this 13% of explaination, what featurs are more important.
# make a regression
def coef_weights(coefficients, X_train):
'''
INPUT:
coefficients - the coefficients of the linear model
X_train - the training data, so the column names can be used
OUTPUT:
coefs_df - a dataframe holding the coefficient, estimate, and abs(estimate)
Provides a dataframe that can be used to understand the most influential coefficients
in a linear model by providing the coefficient estimates along with the name of the
variable attached to the coefficient.
'''
coefs_df = pd.DataFrame()
coefs_df['est_int'] = X_train.columns
coefs_df['coefs'] = lm_model.coef_
coefs_df['abs_coefs'] = np.abs(lm_model.coef_)
coefs_df = coefs_df.sort_values('abs_coefs', ascending=False)
return coefs_df
#Use the function
coef_df = coef_weights(lm_model.coef_, X_train)
#A quick look at the top results
coef_df.head(20)
I can see, within these features I run the regression, the neighbourhood is the most important features, number of batherooms/accommodates and roomtype are also important.
My last interesting question is 'What are the top 5 unique words used by reviewers to discribe each neighbourhood', I will use the sr_df dataset.
# commites are in sr_df dataset, this dataset do not have neighbourhoon column, I need to add
# this column by inner join the sl_df and sr_df
id_sub_df = sl_df[['id','neighbourhood']]
sr_df = sr_df.merge(id_sub_df,left_on = 'listing_id', right_on = 'id')
# To prevent few commit from being str,
sr_df['comments'] = sr_df['comments'].astype(str)
print(sr_df.shape)
sr_df.shape
# first I want to see what's the average words counts within each neighbourhood
sr_df['word_count'] = sr_df['comments'].apply(lambda x:len(x.split(' ')))
# get a list of average commite word counts of each neighbourhood
count_df = sr_df.groupby('neighbourhood').word_count.mean().sort_values(ascending=False).reset_index()
plt.figure(figsize = (10,6))
plt.bar(data = count_df, x = 'neighbourhood', height = 'word_count',bottom = 0)
plt.xticks(rotation = 90);
It seems that in the neighbourhood with greater view, gusets are more likely to write longer commites.
Then I will try to see the context of these commits.
# get the list of most used 150 words by reviewers for every neighbourhood, and form a dict,
# using the surburb name as the key, word list as values.
r = re.compile('[a-z\']+')
sw = stopwords.words('english')
stemmer = SnowballStemmer('english')
sub_dic = {}
for sub in sub_list:
all_word = r.findall(" ".join(sr_df[sr_df['neighbourhood'] == sub]["comments"]).lower())
stemed_word = [stemmer.stem(i) for i in all_word if i not in sw]
words,count = zip(*Counter(stemed_word).most_common(150)) # using most_common from collection.Counter library
sub_dic[sub] = list(words)
print(sub,' finished!')
I want to get every suburb's unique words, I will combine all words in 'sub_dic' I just get, and form a set of first 180 words that appear most often in all suburb. Then I will substract this set from every surburb's word set, to get a more unique smaller word set.
sub_list = sl_df.neighbourhood.value_counts().index
# deepcopy of sub_dic
sub_dic_copy = copy.deepcopy(sub_dic)
# get a big words list
big_list = []
for sub in sub_list:
big_list += sub_dic_copy[sub]
# get the list of most common used 100 words
common_words = Counter(big_list).most_common(150)
common_words_list,common_words_count = zip(*common_words)
common_words
There are lots of words in the top list of every neighbourhood, such as 'great', 'clean', 'recommend', 'love', 'nice', 'good', they shows most of the guests enjoy their staying in airbnb rooms.
# get rid of these common words to find the unique word list of each neighbourhood
def drop_common(alist,blist = common_words_list):
'''
INPUT:
alist - the list need to substract from
blist - the list need to substract with
OUTPUT:
list - all items in alist and not in blist
'''
return [i for i in alist if i not in blist]
for sub in sub_list:
sub_dic_copy[sub] = drop_common(sub_dic_copy[sub])
for sub in sub_list:
print(sub,sub_dic_copy[sub])
From the above list, the first half of neighbourhoods have some interesting words left, but the end half of neighbourhood for fewer comments, I can only get the list of host names. I will focus on the first half of the neighbourhoods.
Now, I can get the first 5 words manually by getting rid of some usless words (such as better, less, reserve, cancel, post) for the first 20 neighbourhoods.
These words match the character of each neighbourhood. There are several interesting discoveries: